MD5-related optimizations by Chainfire · Pull Request #6 · RsyncProject/rsync

Chainfire · 2020-05-31T12:54:46Z

This is a 3-parter. The first commit moves the OpenSSL related defines from from checksum.c to mdigest.h, because I found myself copy/pasting them more than once otherwise.
The second commit enables parallel computation of MD5 hashes in the block matching get_checksum2() phase. As each blocks' hash is independent, we can process up to 4 blocks simultaneously with SSE2, and 8 blocks with AVX2, leading to a real-world 2x to 6x performance gain / CPU usage reduction (even over OpenSSL-optimized MD5).

However, to make this happen without significant changes to the rest of rsync's codebase, a block prefetcher had to be created. (This whole commit requires --enable-simd as my previous contributions). Full compatibility is maintained with non-SIMD counterparts.

The same mechanism could be used for multithreading checksums as well, but that is beyond the scope of this patch.

The third commit provides the MD5P8 whole-file checksum. This is an MD5-based checksum but cuts the input stream into 8 independent streams (64-byte interleave), of which the final states are brought together to create the final MD5 checksum. This has the same strengths (and weaknesses) of the normal MD5 checksum, but allows parallel processing, with again a 2x to 6x performance gain.

Both optimized parallel processing is available (--enable-simd) as reference C. MD5P8 is slightly slower on <10kB files due to the additional overhead, but similar to MD5 on larger files without SIMD, and much faster on larger files with SIMD.

Note that get_checksum2() keeps using normal MD5 even if whole-file checksum is MD5P8, because that is parallelized with SIMD anyway if available, and using MD5P8 would just add overhead and quite probably be slower.

Further note that the CSUM_MD5 and CSUM_MD5P8 defines now appear in both checksum.c and simd-checksum-x86_64.cpp, they need to be kept in sync, perhaps moved to a header?

Motivation: though xxhash is now available for rsync, it is not included into the code itself but an external dependency, and by my last evaluation, many distros do not yet come with xxhash included, and thus the distro-included rsync package will be built without xxhash support. That being said, I can imagine that this PR may not be merged due to it not being part of the direction rsync is moving in, I myself need to be using it due to having an uncommon build target, and the code might as well be available to everyone.

The parallel computation of MD5 hashes in get_checksum2() will benefit connections to both recent builds without xxhash as well as older builds of rsync if the block-matching phase applies. If it doesn't lead to a reduction in transfer time due to connection or disk speed limitations, then it will at least massively reduce CPU usage on the supporting client.

The use-case for MD5P8 is more limited, as its usefulness requires both ends to be running an supporting rsync build, but one end not supporting xxhash. If both ends do support xxhash, that should always be the preferred checksum (while MD5P8 can reach gigabytes per second, xxhash is still twice as fast). I only created it as it was a small effort now that parallel MD5 computation was available anyway, and it doesn't have any external dependencies.

I've done some benchmarks for transferring 1GB files between a fast and a slow CPU on 1GbE LAN, compared to normal MD5 usage (all tests already including my previous block size patches and get_checksum1() optimizations):

get_checksum2() MD5 parallelization with MD5 whole-file checksum, both files existing on both ends:

33% transfer time reduction
52% CPU usage reduction

get_checksum2() MD5 parallelization and MD5P8 whole-file checksum, both files existing on both ends:

54% transfer time reduction
84% CPU usage reduction

xxhash for both get_checksum2() and whole-file checksum, both files existing on both ends:

54% transfer time reduction
90% CPU usage reduction

MD5P8, new file:

33% transfer time reduction
86% CPU usage reduction

xxhash, new file:

33% transfer time reduction
92% CPU usage reduction

MD5P8, local checksum:

83% CPU usage reduction

xxhash, local checksum:

94% CPU usage reduction

Obviously these are highly specific to my setup and YMMV. However, my daily syncing of TB's of data is now twice as fast, with average CPU usage down to less than a quarter. xxhash doesn't run ahead much in this case because CPU power while checksumming is no longer the bottleneck after these patches. With even faster network and disks (10GbE + NVMe) xxhash might be twice as fast.

Works just as well, prevents having to repeat them across files

MD5 hashes computed during rsync's block matching phase are independent and thus possible to process in parallel. This code processes 4 blocks in parallel if SSE2 is available, or 8 if AVX2 is available. An increase of performance (or decrease of CPU usage) of up to 6x has been measured. A prefetching algorithm is used to predict and load upcoming blocks, as this prevents the need for extensive modifications to other parts of the rsync sources to get this working.

Splits the input up into 8 independent streams (64-byte interleave), and produces a final checksum based on the end state of those 8 streams. If parallelization of MD5 hashing is available, the performance gain is 2x to 6x. xxHash is still preferred (and faster), but this provides a reasonably fast fallback for the case where xxHash libraries are not available at build time.

WayneD · 2020-06-02T02:11:02Z

Thanks! I've put the changes into a file named "md5p8.diff" in the rsync-patches repo for now. I incorporated some of the changes that put more info into lib/mdigest.h, and I tweaked a few things for style and to fix a compiler warning. Here's the resulting patch:

https://git.samba.org/?p=rsync-patches.git;a=blob_plain;f=md5p8.diff;hb=ac98f867ff5e7e53a0157b967c7b216c86b0b0a6

WayneD · 2020-06-08T02:48:31Z

I'm going to leave it as a maintained patch for now and consider merging it later.

Chainfire · 2020-06-19T08:41:05Z

I'll update this with the new build tests and applying to latest master

rsync.exe -av <local> user@host:/dst/ now transfers files over SSH with byte-exact verification. Idempotent re-push transfers 0 bytes. Four fixes to clear the runtime path after build came up: * win32/win_select.c: select() shim. winsock's select() only handles SOCKETs; rsync's io.c calls select() on the pipe fds from piped_child. Classify each fd via GetFileType+GetNamedPipeInfo, defer sockets to real winsock select, poll pipes via PeekNamedPipe. 10 ms cadence. ~170 LOC. * win_spawn.c: bump CreatePipe buffer hint to 1 MB so the file-list phase doesn't deadlock on a full 4 KB anonymous pipe. * util1.c::change_dir: treat 'X:\…', 'X:/…', and '\…' as absolute on Windows. Normalize curr_dir to forward slashes after getcwd so path joins don't mix separators. * syscall.c::do_open_nofollow: force O_BINARY (MSVC defaults to text mode); skip the lstat→open→fstat dev/ino symlink-race check on Windows because MSVC's stat/fstat don't return stable values for those fields. Pull and local-copy still hit the RtlCloneUserProcess fork hang — tracked as task RsyncProject#6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ncProject#6 (chmod mode arg) These two small syscall.c fixes were made at the start of the round-4 work but got dropped on the floor when I split the commit -- only the docs (RsyncProject#1) and the bigger RsyncProject#2/RsyncProject#4/RsyncProject#5 deferred-immutable-dir series ended up landed. The tree was left dirty. RsyncProject#3: do_rename (the non-_at variant) was missing the hardlink-aware restore I added to do_rename_at last round. Same shape -- when renameat replaces a destination inode that had st_nlink > 1, the remaining hardlinks survive carrying the cleared flags. Restore via new_fd before close (the fd still refers to the surviving inode). RsyncProject#6: do_chmod and do_chmod_at force_change recovery were calling make_mutable_fd(fd, mode, ...) where mode was the caller-supplied chmod-target mode -- some callers (notably xattrs.c's set_xattr recovery path) pass perm bits only, no S_IFREG / S_IFDIR, so on Linux rsync_fchflags rejects the call as neither regular file nor directory and recovery silently fails. Use st.st_mode from the freshly-fstatted target instead, which always has the right S_IFx bits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Chainfire added 3 commits May 31, 2020 13:31

Move OpenSSL-related MD4/5 defines and imports to lib/mdigest.h

bfb44c2

Works just as well, prevents having to repeat them across files

WayneD closed this Jun 8, 2020

WayneD self-assigned this Jun 19, 2020

Chainfire mentioned this pull request Jun 19, 2020

SSE2/AVX2 optimized get_checksum2()/MD5 for x86-64, and MD5P8 whole-f… #23

Closed

dr-who mentioned this pull request Sep 27, 2021

Triple vvv and above hangs rsync #231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MD5-related optimizations#6

MD5-related optimizations#6
Chainfire wants to merge 3 commits into
RsyncProject:masterfrom
Chainfire:CSUM2-AVX2

Chainfire commented May 31, 2020 •

edited

Loading

Uh oh!

WayneD commented Jun 2, 2020

Uh oh!

WayneD commented Jun 8, 2020

Uh oh!

Chainfire commented Jun 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Chainfire commented May 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WayneD commented Jun 2, 2020

Uh oh!

WayneD commented Jun 8, 2020

Uh oh!

Chainfire commented Jun 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chainfire commented May 31, 2020 •

edited

Loading